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Abstract 

The paper considers model selection in regression under the additional structural constraints 
on admissible models where the number of potential predictors might be even larger than the 
available sample size. We develop a Bayesian formalism which is used as a natural tool for 
generating a wide class of model selection criteria based on penalized least squares estimation 
with various complexity penalties associated with a prior on a model size. The resulting criteria 
are adaptive to structural constraints. We establish the upper bound for the quadratic risk of 
the resulting MAP estimator and the corresponding lower bound for the minimax risk over a 
set of admissible models of a given size. We then specify the class of priors (and, therefore, the 
class of complexity penalties) where for the "nearly-orthogonal" design the MAP estimator is 
asymptotically at least nearly-minimax (up to a log-factor) simultaneously over an entire range 
of sparse and dense setups. Moreover, when the numbers of admissible models are "small" (e.g., 
ordered variable selection) or, on the opposite, for the case of complete variable selection, the 
proposed estimator achieves the exact minimax rates. 
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1 Introduction 



Consider the standard Gaussian linear regression model 



y = X(3 + e 



(1) 



where y G M n is a vector of the observed response variable Y, X nxp is the design matrix of the p 
explanatory variables (predictors) X\, . . . , X p , (3 £ W is a vector of unknown regression coefficients, 
e ~ JV(0, a 2 I n ) and the noise variance a 2 is assumed to be known. 

A variety of statistical applications of regression models in different fields nowadays involves a 
vast number of potential predictors. Moreover, p might be even large relative to the amount of 
available data n (p S> n setup) that raises a severe "curse of dimensionality" problem. However, 
typically only some of the predictors have a truly relevant impact on the response y. Model 
(variable) selection by identifying the "best" sparse subset of these "significant" predictors becomes 
therefore crucial in the analysis of such large data sets. For a selected model (a subset of predictors) 
M, the corresponding coefficients (3 M are then typically estimated by least squares. 

The goodness of model selection depends on the particular goal at hand. One should distinguish, 
for example, between estimation of regression coefficients (3, estimation of the mean vector X(3, 
model identification and predicting future observations. Different goals may lead to different 
optimal model selection procedures especially when the number of potential predictors p might 
be much larger than the sample size n. In this paper we consider mainly the estimation of the 
mean vector X(3 and the goodness of a model M is measured by the quadratic risk E\ \X(3 M — X(3\ \ 2 , 
where (3 M is the least squares estimate of (3 for M. The "best" model then is the one with the 
minimal quadratic risk. Note that the true underlying model in (pQ) is not necessarily the best in 
this sense since sometimes it is possible to reduce its risk by excluding predictors with small (but 
still nonzero!) coefficients. 

Minimum quadratic risk criterion for model selection is evidently impossible to implement since 
it involves the unknown true (3 but the corresponding ideal minimal (oracle) risk can be used as a 
benchmark for any available model selection procedure. Typical model selection criteria are based 
on minimizing the empirical quadratic risk ||y — X/3 M \\ 2 , which is the least squares, penalized by 
a complexity penalty Pen(\M\) increasing with a model size \M\: 



The properties of the resulting penalized least squares estimator depend obviously on the 
particular choice of the complexity penalty Pen(-) in ([2]). The most commonly used choice is 
a linear type penalty of the form Pen(k) = 2a 2 Xk for some A > 0. The most known examples 
include C p (Mallows, 1973) and AIC (Akaike, 1973) for A = 1, BIC (Schwarz, 1978) for A = (lnn)/2 
and RIC (Foster & George, 1994) for A = In p. A series of recent works proposed the so-called 
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2k ln(p/&)-type nonlinear complexity penalties of the form Pen(k) = 2a 2 Xk(lia(p/k) + 1)(1 + o(l)) 
with A > 1. (see, e.g., Birge & Massart, 2001, 2007; Abramovich et al, 2006; Bunea, Tsybakov & 
Wegkamp, 2007; Abramovich & Grinshtein, 2010; Rigollet & Tsybakov, 2011). 

As we have mentioned, in the analysis of large complex data sets it is typically believed that 
the underlying (unknown) model is sparse, where a natural measure of model's sparsity is its size 
Pq. Abramovich & Grinshtein (2010) and Rigollet & Tsybakov (2011) established the minimax 
rates for the quadratic risk of estimating the mean vector Xf3 in (TjQ) over the set of models of sizes 
at most pq (see also analogous results of Raskutti, Wainwright & Yu, 2012 for the case when X 
is of a full rank, i.e. rank(X) = min(p, n)). They also showed that for 2k\n(p/k)-type penalties, 
the resulting penalized estimators (J2|) are simultaneously asymptotically optimal (in the minimax 
sense) for the entire range of sparse and dense models, while linear penalties cannot achieve such 
a wide optimality range. 

So far, the above minimax properties of various model selection procedures in ([1]) have been 
established for complete variable selection, where the set of admissible models contains all 2 P subsets 
of the predictors X\, ...,X p . However, in a variety of regression setups there exist additional 
structural constraints that restrict the set of admissible models. In some cases predictors have 
some natural order and Xj can enter the model only if Xj—i is already there (ordered variable 
selection). For example, in polynomial regression, higher order polynomials are usually considered 
only if polynomials of lower degrees are already in the model. In tree-based models, a predictor 
cannot enter a model unless all its ancestors are there (hierarchical model selection). Another 
important example of hierarchical model selection is models with interactions, where interactions 
are typically selected only with the corresponding main effects. For a model with factor predictors, 
each factor predictor with k levels is, in fact, associated with a group of k — 1 indicator (dummy) 
predictors. In this case either none or all of this group can be selected. It is somewhat similar to 
group sparse models, where predictors are splitted into pre-defined groups (see, e.g. Lounici at al., 
2011). 

In this paper we investigate the minimaxity properties of model selection criteria in ([2]) over 
classes of sparse and dense models under additional general structural constraints extending the 
existing results for the complete variable selection. The key point is to adapt the choice of the 
complexity penalty in (J2j) to the specific structural constraints. In fact, it turns out that it is only 
the number of admissible models of a given size that matters. 

We extend a Bayesian approach to model selection developed in Abramovich & Grinshtein 
(2010) for the complete variable selection. The proposed Bayesian formalism is based on imposing 
a prior on a model size, where the penalty term in ([2]) is then naturally treated as proportional to 
its logarithm. From the Bayesian perspective, the model selection criterion ([2]) corresponds thus 
to the maximum a posteriori (MAP) Bayes rule. Depending on a specific choice of a prior, it 
implies a variety of penalized least squares estimators (or, equivalently, model selection criteria) 
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with different complexity penalties. We show that under mild conditions on the prior, the resulting 
MAP estimator falls within a general class of penalized least squares estimators ([2]) considered 
in Birge & Massart (2001, 2007) with complexity penalties Pen(\M\) satisfying certain technical 
conditions depending on the number of admissible models of size \M\, that allows us to derive the 
upper bound for its quadratic risk. On the other hand, we establish the corresponding lower bound 
for the minimax quadratic risk over a set of models of a given size under structural constraints. 
We then specify the class of priors (and, therefore, the class of the corresponding complexity 
penalties), where for the "nearly-orthogonal" design, the resulting MAP estimators asymptotically 
are at least nearly-minimax (up to a log-factor) simultaneously over the wide range of sparse and 
dense models. In particular, when the numbers of admissible models are "small" (e.g., ordered 
variable selection) and for complete variable selection they lead to AlC-type and 2k ln(p/fc)-type 
estimators respectively and achieve the exact minimax rate. 

The paper is organized as follows. In Section [2] we develop a Bayesian formalism for model 
selection in regression under structural constraints and derive the MAP estimator. Its theoretical 
properties are presented in Section [3l In particular, we establish the upper bound for its quadratic 
risk, the corresponding lower bound for the minimax risk and discuss its asymptotic minimaxity 
in various setups. Section 0] presents the results of a simulation study. Concluding remarks are 
summarized in Section [5l while all the proofs are given in the Appendix. 

2 MAP model selection procedure under structural constraints 

We first extend the Bayesian formalism for the model selection in linear regression developed by 
Abramovich & Grinshtein (2010) to structural constraints. We assume that the latter are known 
and the set of all admissible models is therefore fixed. 

Consider the linear regression model (pQ), where the number of possible predictors p might be 
even larger then the number of observations n. Let r = rank(X)(< mm(p,n)) and assume that 
any r columns of X are linearly independent. For the "standard" linear regression setup, where all 
p predictors are linearly independent and there are at least p linearly independent design points, 
r = p. 

Any model M is uniquely defined by the p x p diagonal indicator matrix Dm = diag(d.M), 
where d,Mj = ^-{Xj £ M} and, therefore, \M\ = tr(DM)- For a given model M, we estimate 
its coefficients by least square estimator (3 M = (D m X' X Dm) + D m X'y , where "+" denotes the 
generalized inverse matrix. 

Let m(po) be the number of all admissible models of size pq. The case po = corresponds to a 
null model with a single intercept and, therefore, m(0) = 1. In fact, we can consider only po < r 
since otherwise, there necessarily exists another vector (3* with at most r nonzero entries such that 
X(3 = Xf3*. Although for po = r there may be several different admissible models, all of them 
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are evidently undistinguishable for estimating X(3 and can be associated with a single (saturated) 
model. Thus, without loss of generality, we can always assume that m(r) = 1. Obviously, for 
any p , < m(po) < (p Q ), where the two extreme cases m(po) = 1 and m(po) = y^j for all 
Po = 0, . . . , r — 1, correspond respectively to the ordered and complete variable selection. 

We start from imposing a prior on the model size ir(k) = P(\M\ = k), k = 0, . . . , r. Obviously, 
ir(k) = and P(M| \M\ = k) = iff there are no admissible models of a size k, i.e. m(k) = 0. 
For m(k) > 0, we assume all m(k) admissible models of a given size k to be equally likely, that is, 
conditionally on the model size \M\ = k, 

P(M\ \M\ =k) = m(k)- 1 , 

where recall that m(r) = 1 and hence P(M\ \M\ = r) = 1. To complete the prior, for any 
given model M we assume the normal prior on its unknown coefficients (3 M : (3 M = (3\M ~ 
N p (0, ^^(DmX' XDm) + ) which is the well-known conventional (/-prior of Zellner (1986). 

For the proposed hierarchical prior, straightforward Bayesian calculus yields the posterior 
probability of a model M: 

d/^i ^ n^n , / 7 y'XD M (D M X'XD M )+D M X'y\ 
P(M|y)oc ir(\M\)m(\M\) + * xexpj-^ * — '- j, (3) 

where we set 7r(|M|)m(|M|) _1 = if m(|M|) = 0. Finding the most likely model leads therefore 
to the following maximum a posteriori (MAP) model selection criterion: 

mzK{y'XD M (D M X'XD M ) + D M X'y + 2a\l + l/ 7 ) In [m{\M\)-\{\M\){l + 7)^}} 

or, equivalently, 

min{||y - XP M \\ 2 + 2a 2 (l + l/ 7 ) In [m(\M\)^(\M\)- l (l + 7)^}} (4) 
which is of the general type ([2|) with the complexity penalty 

Pen{\M\) = 2(j 2 (l + l^lnjmdMDvrdMD-^l +7)^} (5) 
A specific form of the penalty © is defined by the choice of the prior ir(-) on the model size. 

3 Main results 
3.1 Risk bounds 

Denote the set of all admissible models by f2. For a given model size 1 < po < r, define the set of 
all admissible models Ai po of size po, that is, M po = {M G $7 : |M| = po} and card(Ai P0 ) = m(po). 
Obviously, if a model M € Mp , the corresponding coefficients vector (3 M € MP has po nonzero 
components, where /3m j 7^ iff Xj is included in M. Let Wl Po = \J^° =0 .Mk be the set of all 
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admissible models with at most po predictors. Following our arguments from Section [21 O can be 
essentially reduced to Wl r . 

In this section we derive the upper and lower bounds for the maximal risk of the proposed MAP 
model selector fl3J) over 9JT po . 

Theorem 1 (upper bound). Let M be the solution of and 0^ be the corresponding least squares 
estimator of its coefficients. Define 0(7) = 8(7 + 3/4) 2 > 9/2 and assume that for some constant 
c> 0, 

min{m(fc)- c , e~ ck } < n(k) < m(k)e- c ^ )k (6) 

for all k = 1, . . . , r such that m(k) > 0. 

Then, there exists a constant C{^f) > depending only on 7 such that 

sup E\\X@m - Xf3 M \\ 2 < C(7)cr 2 min{max(po,lnm(p )) ,r} (7) 

simultaneously for all 1 < po < r. 

Under the conditions ([B]) on the prior n(k) in Theorem[U the corresponding penalty (0) satisfies 

f k 

Ci{~f)a 2 k <Pen(k) < C 2 (-f)a 2 max i (c + 1) In m(k) + - ln(l + 7) ; Inm(fc) + k(c + 0.51n(l + 7)) 

< £3 (7)o- 2 max {In m(k),k} , k = l,...,r (8) 

for some positive constants Ci (7), 6*2(7) an d ^3(7)- 

To assess the accuracy of the established upper bound for the quadratic risk of the proposed 
MAP estimator, we derive the lower bound for the minimax risk of estimating Xf3 in (pQ). 

The Iq quasi-norm ||/3||o of a vector (3 is defined as the number of its nonzero entries. For any 
given k = 1, . . . , r, let 4> m in[k} and 4> ma x[k] be the fc-sparse minimal and maximal eigenvalues of the 
design defined as 

, r ,, . Il^>|| 2 

9min[k\ = mm , 

/3:i<||/3||o<fe II/3II 2 

6 \k]~ max 1™£ 
^[fcj-^maoc^ ||/3||2 

In fact, m i n [fe] and 4> m ax[k] are respectively the minimal and maximal eigenvalues of all k x k 
submatrices of the matrix X'X generated by any k columns of X. Let r[k] = 4> m in[k]/4>max[k], k = 
1, . . . ,r. By the definition, r[k] is a non-increasing function of k. Obviously, r[k] < 1 and for the 
orthogonal design the equality holds for all k. 

Theorem 2 (minimax lower bound). Consider the model ([7]) and let 1 < po < r. There exists a 
universal constant C > such that 



^max{r[2p ]^g^,r[pobo} , 1 < Po < r/2 



inf sup E\\y-X(3 M \\ 2 > i L - - max U ,m P oJ ' - — j ' (9) 

y (3 M :Me<m P0 ( Co- 2 t[ Po } r, 

where the infimum is taken over all estimates y of the mean vector Xf3. 
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In some particular cases, e.g. for complete variable selection, the general minimax lower bounds 
established in Theorem [2] can be improved by removing the max(l, lnpo)-term in (Abramovich 
& Grinshtein, 2010; Rigollet & Tsybakov, 2011; Raskutti, Wainwright & Yu, 2012). Whether this 
additional log-term can be removed in the general case remains so far a conjecture. Note however 
that if lnm(po) = O(po) (in particular, for the ordered variable selection, where m(po) = 1), the 
dominating term in both bounds ([7]) and Q is anyway p$. 

The upper bound (J7|) holds for any design matrix X of rank r, while the minimax lower bound 
([9]) depends on X but only through the sparse eigenvalues ratios. Finally note that the structural 
constraints are manifested in the upper and lower bounds only through m(po) - the number of 
admissible models of size p$. 

3.2 Asymptotic adaptive minimaxity of the MAP estimator 

The established upper and lower risk bounds (J7J), ([9]) in the previous Section 13.11 is the key for 
investigating the asymptotic minimaxity of the proposed MAP estimator, where the number of 
possible predictors p = p n may increase with the sample size n. One can view such a setup as a 
series of projections of the vector X(3 on the expanding span of predictors. In particular, it may be 
p n > n or even p n 3> n. Thus, formally, we consider now a sequence of design matrices X p ^ n , where 
r n = rank(X nt p) — > oo. For simplicity of exposition, hereafter, we omit the index n. Similarly, 
there are sequences of the coefficient vectors f3„ and priors vr p (-). In these notations the original 
model ([T]) is transformed into a sequence of models 

y = X p (3 p + e, (10) 

where rank(X p ) = r and any r columns of X p are linearly independent (hence, r p [r] > 0), 
e ~ N(0, cr 2 I n ) and the noise variance a 2 does not depend on n and p. 

We consider the nearly- orthogonal design, where the sequence of sparse eigenvalues ratios r p [r] 
is bounded away from zero. Nearly-orthogonality means that there are no "too strong" linear 
relationships within any set of r columns of the design matrix X p . Evidently, in this case p cannot 
be "too large" relative to r and, therefore, to n. Indeed, Abramovich & Grinshtein (2010) showed 
that nearly-orthogonality of a design necessarily implies p = 0{r) and, thus, p = 0(n). In this 
case, 

max (lnm(po);Po) < max ( In ( ^ ] ,po I — Po(^ n (p/Po) + 1) < p = 0(r) 

V \PoJ J 

and the following two corollaries are immediate consequences of Theorems [T] and [5J 

Corollary 1 (bounds for the minimax risk). Let the design be nearly- orthogonal. There exist two 
constants < C\ < C2 < 00 such that for all sufficiently large r, 

Cia 2 max ( — ^ - - , p 1 < inf sup E\\y - X p /3 M \\ 2 < C 2 (J 2 max {lnm(p ),po} 
lmax(l,lnpo) J y (3 M -.M&m Po 
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for all 1 < po < r. 

Corollary 2 (asymptotic adaptive minimaxity of the MAP estimator). Consider the nearly- 
orthogonal design and assume that for m{k) > the prior ir(k) satisfies 

min{m(A:)- c ,e- cfc } < ?r(fc) < m{k) e - c{y)k , k = l,...,r 

for some c > and 0(7) defined in Theorem {J\ Then the corresponding MAP estimator is 
asymptotically at least nearly-minimax (up to a InpQ-factor) simultaneously over all 9Jt Po , 1 < po < 
r. 

The above general results depend on the asymptotic behavior of m(po) as a function of pq. 
Similar to Birge & Massart (2007) we consider the following three typical cases: 

1. Small numbers of admissible models: \nm(k) = 0(k), k = l,...,r — 1 (recall however 
that m{k) <(£)). 

In particular, for ordered variable selection, m(k) = 1 though "small" numbers allow also even 
exponential growth of m(k) for small and moderate k. 
For this case, Corollary Q] and Theorem [1] imply: 

Corollary 3. Consider the nearly- orthogonal design and let lnm(fc) = 0(k), k = 1, . . . , 1 — 1. As 

r increases, 

1. 

inf sup E\\y - X p (3 M \\ 2 x a 2 p (11) 

for all p = l,...,r. 

2. For m(k) > assume that e~ ck < ir(k) < m(k)e~ c ^ k , k = l,...,r for some c > 0. 
The resulting MAP estimator attains then the minimax rates simultaneously over all 
9Kpoi 1 <Po<r. 

Note that o~ 2 po is the risk of (unbiased) least squares estimation of X(3 M for a true model Mq 
of size po hi ([T]). Corollary [3] therefore verifies that when the number of admissible models is small, 
there is essentially no extra price for model selection. 

It follows from ([5]) that priors satisfying the conditions of Corollary [3] lead to the AlC-type 
penalties of the form Pen(k) ~ 2C{^f)a 2 k for some C(^y) > 1. 

2. Complete variable selection: m{k) = (?), & = 1, . . . , 1 — 1. 

As we have mentioned above, from the already known results of Abramovich <fe Grinshtein (2010), 
Rigollet & Tsybakov (2011) and Raskutti, Wainwright & Yu (2012) the max(l, lnpo)-term in the 
lower minimax risk bound can be removed in this case: 

Corollary 4. Consider complete variable selection for the nearly- orthogonal design. As r increases, 



8 



1. 

inf sup E\\y - X p (3 AI \\ 2 x a 2 p (ln(p/p ) + 1) 

y P M :M&m P0 

for all po = l,...,r. 

2. Assume that (j^ < 7r(fc) < ^ fc e e( 7 ) ) > fc = 1, . . . , r — 1 and e~ cr < ir(r) < e~ c ^ r for some 
c > 0(7). T/ie resulting MAP estimator attains then the minimax rates simultaneously 
over all Wl Po , 1 < po < 

The upper bound on 7r(fc) in Corollary 2] trivially holds for all /c < pe~ c ^\ 

Corollary [J] shows that for complete variable selection, model selection yields an additional 

multiplicative factor of ln(p/po) to the risk o~ 2 po of estimating X(3 AIo for a true model Mq of size 

Po in ©■ 

The conditions on the prior of Corollary U] hold, for example, for the truncated geometric prior 
Tv(k) oc q k , k = 1, . . . ,r for some < q < 1, and the corresponding penalties in ([5]) are of the 
2fcln(p/fc)-type, where Pen(k) = 2C(-/)a 2 k(\n(p/k) + 1)(1 + o(l)) for some (7(7) > 1. 

3. Intermediate case: k = o(\nm{k)) and m(/c) < Kj, fc = 1, . . . , r — 1. 

A practically interesting example of the intermediate case is hierarchical model selection with paired 
interactions mentioned in Section [TJ 

Example: hierarchical model selection with paired interactions. 

Consider model selection in regression with K main predictors and their paired interactions. The 
overall number of possible predictors p in © is therefore p = K + (£) = K(K + l)/2. However, 
an interaction can be included in the model only together with the corresponding main effects. 
Obviously, m(k) < (?), k = 1, . . . , r — 1. On the other hand, this is an example with "large" 
numbers of admissible models, where lnm(fc) > cfeln(p/fc) for some < c < 1. Indeed, for 
models of sizes one and two only main effects can be selected, and the numbers of admissible 
models are m(l) = K and m(2) = ( 2 ) respectively. One can trivially verify that in both cases 
lnm(A;) > ck\n(p/k), k = 1,2 for some positive constant c < 1. For k > 3, we have the following 
lemma: 

Lemma 1. For aZZ 3 < A; < r — 1, 

]nm(k) > 

For the intermediate case there is the lnpo _ g a P between the upper bound for the quadratic risk 
of the MAP estimator in Theorem [1] and the minimax lower bound in Theorem [2j So far, we can 
claim that if m(k)~ c < ir(k) < m(k)e~ c<a ^ k \ k = l,...,r and the design is nearly-orthogonal, 
the resulting MAP estimator is asymptotically at least nearly-minimax (up to hipQ-i&ctor) over 



\n(p/k) 
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all 9K po , 1 < po 5; r aR d can only conjecture that, similar to the complete variable selection, this 
log-factor can be removed in this case as well. 

Similar to complete variable selection, when the numbers of admissible models for the 
intermediate case are "large" (e.g., hierarchical model selection with main effects and paired 
interactions considered above), the conditions on the prior in Corollary [2] are satisfied for the 
truncated geometric prior n(k) oc q k , k = 1, . . . , r for some < q < 1 corresponding to complexity 
penalties of 2k ln(p/fc)-type. 

Note also that for the nearly-orthogonal design, \\X p p ^ — X p (3 p \\ X \\(3 p fy — P p \\ and all the 
results of this section for estimating the mean vector X P (3 M can be therefore straightforwardly 
applied to estimating the regression coefficients (3 M . 

4 Simulation study 

We conducted a short simulation study to demonstrate the performance of the proposed MAP 
model selection procedure. We considered polynomial regression which is an example of the ordered 
variable selection (see Section [[]): 



where < k < n — lis the polynomial degree to be selected, and ~ A/"(0, a 2 ) with the known 
variance <r 2 and independent. In this case obviously m(k) = 1 for all k = 0, . . . , n — 1. An example 
of a prior satisfying the conditions of Corollary [3] is the (truncated) geometric prior Geom(l — q), 
where ir(k) = (1 — q)q k /(l — q n ) oc q k , k = 0, . . . , n — 1 for some < q < 1. The corresponding 
complexity penalty (O is the AlC-type linear penalty 



4.1 Estimation of parameters 

To apply the developed MAP model selection procedure we need to specify the prior parameters 
7 and q in (|12p . They are rarely known a priori in practice and usually should be estimated from 
the data. 

Let X nxn be the design matrix of the saturated model, that is, Xij = xj , i = 1, . . . ,n;j = 
0, . . . ,n — 1. For a given k, define the corresponding diagonal indicator matrix = diag(dj;), 
where d^j = 1, j = 1,. . . ,k and zero otherwise (see Section [2|). Based on the proposed Bayesian 
model from Section [2J 



Vi = A) + PiXi + • • • + PkX k + €i, i = 1, . . . 



n 




(12) 



y|A;~AA(0,cT 2 (/ + 7^)) 
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where = XD^D^X 1 XD^) + D^X' , and straightforward calculus yields the following marginal 
likelihood of the observed data y: 

My; i,Q) = y. hT =, ex P < ir- 2 \ *w 

— r 7 y^y) (l-g)g k 



° C 5 (l+7) fc/2eXP l7+l 2a2 / 1 

The MLEs for 7 and q can be obtained (numerically) by the EM algorithm. Regard k as a 
"missing" data and define the corresponding latent indicator variables Uj = 5jk, j = 0, . . . , n — 1. 
The complete log-likelihood for the "augmented" data (y, u), up to an additive constant, is then 

n—1 1 tt i (-1 I ^ n—1 n—1 

l(y, U ; 7, <?) = ^ ^ Uk ^^f ~ 2 + In g J> fc * + ln(l - 9 ) - - q n ) 

~ fc=0 ° k=0 k=0 

On the E-step at the h-th. iteration we compute the conditional expectation 
m = E {l(y,vL; 7) q)\y,^ h \ q ^ 

\^4h]y'H k y ln(l + 7)^. M v^[ftk , , /-, \ 1 m n\ ^ 13 ^ 

k=0 k=0 k=0 



_1_ 

7+ 



where 



L,j=Q eX P / ^ l JJ 

At the M-step we maximize w.r.t. 7 and g to get 

*[a+i] = ( Efc=d 4 fe V /ff fcy _ ^ 

There is no closed form solution for 5^+1] _ However, replacing the truncated geometric distribution 
by a usual one and ignoring thus the last term ln(l — q n ) in the RHS of (|13p . implies a good 
approximation of q^ l+1 ^ for large n: 

4.2 The results 

We used two test functions: a fifth degree polynomial = (x + 0.1)(x — 0.2) (a: — 0.4) (a; — 



0.8) (x — 1.1) and the Doppler function g^(x) = ^/x(l — x) sin(27r • 1.05/(x + 0.05)) (see Donoho 
& Johnstone, 1994) which does not have a sparse polynomial approximation. Both functions were 
then normalized to have unit M[0, l]-norms. The data were generated according to the model 

Ui = gi,i{i/n) + a, i = l,...,n 
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for n = 100, where ~ A/"(0, a 2 ) and independent. The noise variance a 2 was chosen to ensure 
the signal-to-noise ratio SNR at levels 3, 5 and 7 and was assumed to be known. The number of 
replications was 100. 

The proposed MAP model selector results in a model selection procedure with a linear type 
penalty of the form Pen(k) = 2a 2 Xk with \ MAP = (1 + I/7) kifa^y/T+j) (see ([12]) ). We 
compared it with two other well-known model selection procedures with linear penalties, namely, 
AIC (Xaic = 1) an d RIC (Xric = m P) (see Section[T]). In our case, Xric = In 100 = 4.6. 

Table [T] summarizes mean squared errors averaged over 100 replications (AMSE). We also 
present the average polynomial degrees selected by the three methods for approximating the true 
response functions. 

Table 1: AMSEs and polynomial degrees (in parentheses) averaged over 100 replications for three 
estimators 





91 


92 


SNR 


MAP 


AIC 


RIC 


MAP 


AIC 


RIC 


3 


0.661 


0.813 


0.652 


28.844 


27.818 


35.188 




(5.01) 


(5.50) 


(5.00) 


(29.07) 


(33.66) 


(18.11) 


5 


0.238 


0.293 


0.235 


24.677 


23.450 


27.056 




(5.01) 


(5.50) 


(5.00) 


(44.13) 


(53.48) 


(29.40) 


7 


0.121 


0.149 


0.120 


22.995 


22.698 


26.036 




(5.01) 


(5.50) 


(5.00) 


(52.72) 


(58.57) 


(32.21) 



As expected, a more conservative RIC tends to include less terms in the model and outperforms 
AIC for a polynomial g±, while the latter is superior for g2, where a high order polynomial 
approximation is required. The MAP estimator with estimated 7 and q yields a data-driven A 
and is adaptive to the unknown polynomial degree - it behaves very similar to RIC when it is low 
and to AIC when it is high. 

5 Concluding remarks 

In this paper we considered model selection in linear regression under general structural constraints 
and extended the existing results for the complete variable selection. In particular, we utilized a 
Bayesian MAP model selection procedure of Abramovich & Grinshtein (2010) and modified it 
correspondingly. From a frequentist view, the resulting MAP model selector is a penalized least 
squares estimator with a complexity penalty associated with a prior on the model size which is 
adaptive to the structural constraints. In fact, the proposed Bayesian approach can be used as 
a tool for generating a wide class of penalized least squares estimators with various complexity 



12 



penalties. 

We established the general upper bound for the quadratic risk of the MAP estimator over a set of 
admissible models of a given size and the lower bound for the corresponding minimax risk. Based on 
these results, we showed that for the nearly-orthogonal design, the MAP estimator is asymptotically 
at least nearly minimax (up to a log-factor) for the entire range of sparse and dense models. 
Moreover, when the numbers of admissible models are "small" or, on the opposite, for the case of 
complete variable selection, it achieves the exact minimax rates. The corresponding MAP model 
selection procedures lead respectively to AlC-type and 2k ln(p/A;)-type criteria. Whether these 
results on asymptotic minimaxity are true for the intermediate case remains so far a conjecture. 

There are also other challenges for future research. The assumption of nearly-orthogonal design 
used in investigating asymptotic minimaxity of MAP estimators in Section 13.21 typically does not 
hold for p 3> n setup due to the multicollinearity phenomenon. The analysis of multicollinear 
design, where the sequence of sparse eigenvalues ratios T p [r] may tend to zero as p increases is 
much more delicate. In this case there is a gap (in addition to a log-factor) between the rates in 
the upper and lower bounds (J7J) and ([9]). Unlike model identification or coefficients estimation, 
where multicollinearity is a "curse" , it may essentially become a "blessing" for estimating the mean 
vector allowing one to exploit correlations between predictors to reduce the size of a model (hence, 
to decrease the variance) without paying much extra price in the bias term. Interestingly, a similar 
phenomenon also occurs in a testing setup (e.g., Hall &: Jin, 2010). Abramovich &: Grinshtein 
(2010) investigated the complete variable selection for multicollinear design and showed that under 
certain additional assumptions on the design and the regression coefficients, the MAP estimator 
corresponding to the 2k ln(p/fc)-type complexity penalty remains asymptotically rate-optimal (in 
the minimax sense) even for this case. Whether it is true and what are the additional conditions 
in the presence of structural constraints is a challenging topic for future research. 

Computational issues are another important problem. When the numbers of admissible models 
are large (e.g., complete variable selection or hierarchical model selection with interactions), 
minimizing ([2]) (and (j3|) in particular) requires generally an NP-hard combinatorial search. During 
the last decade there have been substantial efforts to develop various approximated algorithms for 
solving ^) that are computationally feasible for high-dimensional data (see, e.g. Tropp & Wright, 
2010 for a survey and references therein). The common remedies involve either greedy algorithms 
(e.g., forward selection, matching pursuit) approximating the global solution by a stepwise sequence 
of local ones, or convex relaxation methods replacing the original combinatorial problem by a related 
convex program (e.g., Lasso and Dantzig selector for linear penalties). Abramovich &; Grinshtein 
(2010) proposed to utilize the developed Bayesian formalism for solving ([3D by using a stochastic 
search variable selection (SSVS) techniques originated in George & McCullogh (1993, 1997). The 
underlying idea of SSVS is based on generating a sequence of models from the posterior distribution 
P(M\y) in @. The key point is that the relevant models with the highest posterior probabilities will 
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appear most frequently and can be identified even for a generated sample of a relatively small size 
avoiding computations of the entire posterior distribution. However, most of the above approaches 
have been developed and studied for complete variable selection. Their adaptation to minimization 
of subject to the additional structural constraints while remaining computationally feasible 
should depend on the specific type of constraints at hand. In particular, for somewhat different 
priors, Chipman (1996) and Farcomeni (2010) considered SVSS for hierarchical model selection 
in regression with paired interactions (see the example in Section [3]) and for model selection with 
factor predictors, where the corresponding dummy variables are all included or excluded. Bien, 
Taylor & Tibshirani (2012) modified Lasso for hierarchical paired interactions (see also references 
therein). However, to the best of our knowledge, there are no theoretical results on the optimality 
of the resulting estimators. 

Finally, we should note that the obtained theoretical results assume that the noise variance a 2 
is known which is rarely the case in practical applications. One can estimate a 2 from the data with 
the additional tuning parameters of the MAP procedure (see Section |4.1|) . Alternatively, he can 
follow the fully Bayesian approach and impose some prior distribution on it (see, e.g., Chipman, 
1996; Chipman, George & McCulloch, 2001 and Farcomeni, 2010). 

Acknowledgments. Both authors were supported by the Israel Science Foundation grant ISF- 
248/08. 

6 Appendix 

Throughout the proofs we use C to denote a generic positive constant, not necessarily the same 
each time it is used, even within a single equation. 

6.1 Proof of Theorem [j] 

We first show that under the conditions on a prior in Theorem[TJ the corresponding penalty Pen(k) 
in ([5]) belongs to the class of penalties considered in Birge &: Massart (2001) and then use their 
Theorem 2 to establish the general upper bound for the quadratic risk of the MAP estimator @. 
Define 

L k = ^-\n(m(k)n-\k)) , k = l,...,r, (14) 

where under the conditions on the prior 7r(-) in Theorem[U > 0(7). In terms of the complexity 
penalty is Pen(k) = a 2 {\ + l/j)k(2L k + ln(l + 7)). 

In our notations the conditions (3.3) and (3.4) on in Theorem 2 of Birge & Massart (2001) 
correspond respectively to 

r 

J2m{k)e- kLk < c (15) 

k=l 
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and 

(l + l/ 7 )(2L fc +ln(l+ 7 )) > C(l + 72i;) 2 , k = l,...,r (16) 

for some C > 1. 

The condition ()15p follows immediately from the definition of L k : 

r r 

^ m{k)e- kLk =J2Tr(k) = 1 - tt(0) < 1 
fc=l fe=l 

Consider now (fT6|) which is equivalent to the inequality 

2(1 + l/ 7 - C)L k - 2C^2T k + (1 + l/ 7 ) ln(l + 7) - C > (17) 

Repeating the calculus in the proof of Theorem 1 of Abramovich, Grinshtein & Pensky (2007), one 
verifies that with the upper bound on the prior ir(k) in ©, for C = 1 + 1/(27), (H3) an d therefore 
(I16[) are satisfied for all L k , k = 1, . . . , r. 

Given (|15p - (|16p . Theorem 2 of Birge & Massart (2001) yields the following upper bound for the 
quadratic risk of the MAP estimator @ of the mean vector X(3 in ([1]): 

EWX^ - X(3\\ 2 < c ( 7 ) M {\\X(3 M - X(3\\ 2 + Pen(\M\)} + Ci ( 7 )ct 2 (18) 

for some 00(7) and 01(7) depending only on 7, where Pen(\M\) is given in ([5]). 
Recall that m(r) = 1. Then, using the lower bound on 7r(r), (|18p and (|8j) imply 

sup £||X^-X/3 M || 2 < sup E\\XP£-Xj3 M \\ 2 < c ("/)Pen(r) + ^(7)^ 
P M :Me<m P0 (3 M :Men r 

< C{j)a 2 r (19) 

for all 1 < po < r - 

On the other hand, for po < r from (fT8|) and (jSJ) we have 

sup A/3 M || 2 < co(7)Pen(p ) + ci( 7 )cr 2 < C( 7 )ct 2 max{lnm(p ),p } (20) 

(3 M :MeWlp 

Combining (|19p and (|20|) completes the proof. 
□ 

6.2 Proof of Theorem H 

We first show that 

inf sup E||y- A/3 M || 2 > r[p ]a 2 p (21) 

y /3 M :A/e0R PO 

for all 1 < po < r. 
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Consider the original regression model ([T]) and assume that the true coefficients vector f3 M G 
Wl po . Define Xm = XDm , where the p x p diagonal indicator matrix Dm was defined in Section 
[2j In the coefficients domain one then has 

w = p M + e', (22) 

where w = (X' M X M )+X' M y and e' ~ Af(0, a 2 (X' M X M )+). 

Let R(Wl pa , ct 2 (X m Xm) + ) be the minimax risk of estimating (3 M in ([22]) over Tl po , that is, 

R{m po ,a 2 {X ! M X M ) + ) =inf sup E||£-/3 A/ || 2 

/3 f3 M :M&m P0 

Evidently, 

inf sup E||y-X/3 A/ || 2 > min [po] i?(9?l P0 , a 2 (X; // X M )+) (23) 
y f3 M -.Me<M P0 

Consider also the model ([22]) but with the uncorrelated noise e" ~ JV(0, a 2 4>^- ax \pQ\D m)'- 

w = /3 M + e" (24) 

and the corresponding minimax risk R(^JJl Po ,a 2 4>^ ax \po]DM)- 

Since (X' m Xm) + > 4>max\Po]DM in the usual sense that (X' m Xm) + — 4>^rtax [po]Dm is positive 
semi-definite, 

R(m po , a 2 {X' M X M ) + ) > R(m po ,a 2 ^ l ax [po\D M ) (25) 

(see, e.g., Lemma 4.27 of Johnstone, 2011). 

No estimator of (3 M in (|24p obviously cannot outperform the oracle estimator that knows the 
true /3 M G Wl Po whose ideal minimal quadratic risk is Y^j=i m ^ n (^M j ' a2 ^maxlPo]) ( e -g-> Donoho & 
Johnstone, 1994). Hence, 

v 

R(^ Pi) ,o 2 4)^ ax \p G \D M )> sup ^mm(f3l Id ,a 2 (f)-\ x [po}) > ^(pmaxipo] Po (26) 

(3 M :M&m m j=l 

Combining $Z5§ and (gHJ) implies (12TD . 

Consider now 1 < p$ < r/2 and show that in this case, in addition to the lower bound (|21[) . 

inf sup E\\y-XP M \\ 2 >Ca 2 r[2p }- lnm{Po) 



y /3 A/ :A/es« P0 max(l,lnp ) 
Let 

^( W) = C.V[2 W ] ■""<»> (27) 
max(l, mpoj 

The core of the proof is to find a subset £> Po of vectors (3 M , M G 9JT Po and the corresponding subset of 
mean vectors Q po = {g € E n : g = X(3 M , f3 M G £ po } such that for any gi, g 2 G g po , ||gi - g 2 || 2 > 
4s 2 (p ) and the Kullback-Leibler divergence if(P gl ,P g2 ) = |gl 2o g 2112 < (1/16) In card(£ po ). Lemma 
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A.l Bunea, Tsybakov & Wegkamp (2007) will imply then that s 2 (po) is the minimax lower bound 
over %Jl P0 . 

The standard techniques for constructing such sets of vectors for the complete variable selection 
setup is based on generalizations of Varshamov- Gilbert bound (e.g, Abramovich & Grinshtein, 2010; 
Rigollet & Tsybakov, 2011; Raskutti, Wainwright & Yu, 2012). Unfortunately, it cannot be applied 
when there are additional structural constraints on the set of admissible models. We utilize instead 
the recent combinatorial results of Gutin & Jones (2012). 

Define the subset B Po of all vectors f3 M ,M G M Po C Tl Po that have po entries equal to C Po 
defined later, while the remaining entries are zeros: B Po = {@m : Pm ^ {{0,C P0 } P }, M G -M po }. 
Let p{(3 1m ,P 2 m) = Y$=i^{PiM,j P2M,j} be the Hamming distance between f3 1M , (3 2M S B po 
and define 



Pmax= max _ p(f3 1M ,(3 2M ) and p min = min _ p((3 1M ,(3 2M ) 

Theorem 2 of Gutin & Jones (2012) ensures that for any constant C > 2 there exists a subset 
B Pa C B Po such that 

Pmax ^ q lncard(B P0 ) > alnm(po)) 

Pmin 

where 

= r in(pp/2) - 
a ln(C/2) 

Consider the corresponding subset of mean vectors Q Po , where card(<5 Po ) = card(^ Po ). For any 
gij g2 S Gp and the associated with them f3 1M , (3 2M G B Po we then have 

||gi - g 2 || 2 = \\X(J3 1M - (3 2M )\\ 2 > ^ mi „[2p ] \\(3 1M - p 2M \\ 2 > (f>rmn[2 P0 ]Clp min (28) 



On the other hand, by similar arguments, the Kullback-Leibler divergence satisfies 

*>max [2go] C 2 p(f3 1M ,P 2M ) (f) ma x[2po]C po Pmax 

} ~ 2a 2 - 2a 2 



K(F gl , P g2 ) < 1 ^-Wrv-^^^/ < rn^ r^~por^ 

Set now 



2 _ o- 2 aln?n(p ) 



^ &Pma,x4 > max 



[2po] 

Then, (HHD and ([29]) yield ||gi-g 2 || 2 > T[2p ]tf 2 alnm(po)/(8C), K(P Sl ,F B3 ) < (1/16) In card(g po ), 



and Lemma A.l of Bunea, Tsybakov &: Wegkamp (2007) with s (po) from (|27p completes the proof. 
□ 



6.3 Proof of Lemma [T] 

1. 3 < k < \K 

Consider first all models that include any [f J paired interactions and the corresponding main 
effects. Evidently, the size of any such a model is at most k. If it is less than k (it happens when 



17 



the same main effect appears in several interactions), we complete it to k by adding other main 
effects from the remaining ones (one can easily verify that there are enough remaining main effects 
since k — [f J < K). We then have 

m(k) > 

and, therefore, by straightforward calculus, 



K(K-l) 
2 



lnra(fc) > 



> 


k 


^(^ _1) )> 


k 


In ^ 




3 


v 2 LiJ )~ 


3 





'K{K + l) 
2k 



\n(p/k) 



2. pC < jfe <r- 1 

For this case we consider models of size k that include all K main effects and any k — K paired 
interactions. Thus, 

/ K(K—1) 

•n(k)>( k : K 



and 



In ,n(k) >(*-*) In > h - In ( 

KJ ~ K > \2(k-K))-3 \2(k-K) 



To complete the proof one can easily verify that > K ^k~^ ■ 



□ 



References 

[1] Abramovich, F., Benjamini, Y., Donoho, D.L. and Johnstone, I.M. (2006). Adapting 
to unknown sparsity by controlling the false discovery rate. Ann. Statist. 34, 584-653. 

[2] Abramovich, F. and Grinshtein, V. (2010). MAP model selection in Gaussian regression. 
Electr. J. Statist. 4, 932-949. 

[3] Abramovich, F., Grinshtein, V. and Pensky, M. (2007). On optimality of Bayesian 
testimation in the normal means problem. Ann. Statist. 35, 2261-2286. 

[4] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle, 
in Second International Symposium on Information Theory, (eds. B.N. Petrov and F. Czaki). 
Akademiai Kiado, Budapest, 267-281. 

[5] Bien, J., Taylor, J. and Tibshirani, R. (2012). A Lasso for hierarchical interactions. 
http://arxiv.org/pdf/1205.5050 



[6] Birge, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3, 203-268. 



18 



[7] Birge, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. 
Theory Relat. Fields 138, 33-73. 

[8] Bunea, F., Tsybakov, A. and Wegkamp, M.H. (2007). Aggregation for Gaussian 
regression. Ann. Statist. 35, 1674-1697. 

[9] Chipman, H. (1996). Bayesian variable selection with related predictors. Canad. J. Statist. 
24, 17-36. 

[10] Chipman, H., George, E.I. and McCullogh, R.E. (2001). The Practical Implementation 
of Bayesian Model Selection. IMS Lecture Notes - Monograph Series 38. 

[11] Donoho, D. and Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet shrinkage. 
Biometrika 81, 425-455. 

[12] Farcomeni, A. (2010). Bayesian constrained variable selection. Statistica Sinica 20, 1043- 
1062. 

[13] George, E.I. and McCullogh, R.E. (1993). Variable selection via Gibbs sampling. J. Am. 
Statist. Assoc. 88, 881-889. 

[14] George, E.I. and McCullogh, R.E. (1997). Approaches to Bayesian variable selection. 
Statistica Sinica 7, 339-373. 

[15] Gutin, G. and Jones, M. (2012). Note on large subsets of binary vecrtors with similar 
distances. SIAM J. Discrete Math. 26, 1108-1111. 

[16] Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in 
correlated noise. Ann. Statist. 38, 1681-1732. 

[17] Johnstone, I.M. (2011). Function Estimation and Gaussian Sequence Models, unpublished 
manuscript. 

[18] LOUNICI, K., Pontil, M., Tsybakov, A. and van de Geer, S. (2011). Oracle inequalities 
and optimal inference under group sparsity. Ann. Statist. 39, 2164-2204. 

[19] Mallows, C.L. (1973). Some comments on C p . Technometrics 15, 661-675. 

[20] Raskutti, G., Wainwright, M.J. and Yu, B. (2012). Minimax rates of estimation for 
high-dimensional regression over l q balls. IEEE Trans. Inf. Theory, to appear. 

[21] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse 
estimation. Ann. Statist. 39, 731-771. 

[22] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464. 



19 



[23] Tropp, J. A. and Wright, S.J. (2010). Computational methods for sparse solution of 
linear inverse problems. Proc. IEEE, special issue "Applications of sparse representation and 
compressive sensing". 

[24] Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g- 
prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno 
de Finetti (eds. Goel, P.K. and Zellner, A.), North-Holland, Amsterdam, 233-243. 



20 



